Skip to content

Conversation

@scotwells
Copy link
Contributor

@scotwells scotwells commented Jan 9, 2026

Summary

Optimize Clickhouse database schema for platform-wide and user-specific querying of multi-tenant audit log data.

Details

As we began running performance tests against the activity apiserver, we noticed that platform-wide queries were performing drastically worse than tenant-level queries. See datum-cloud/enhancements#536 (comment) for a comparison.

This was a result of our initial schema being designed to order data by tenant resulting in platform-wide queries scanning the entire data set instead of being able to skip over irrelevant rows.

Performance improving changes

  • Moved to daily partitions so that partitions are TTL'd at a finer granularity so older partitions are out sooner.
  • Removes unnecessary skip indexes on fields already present in the ordering of the data. Skip indexes won't provide much performance benefit if the ordering is used.
  • Adds skip indexes for fields used for common querying patterns to help skip over irrelevant rows.
  • Creates new projections that are designed to efficiently query audit logs across all tenants.
    • The platform-wide query projection is designed to support platform administrators querying across all tenants. Queries will be most performant when they query by a specific api group and resource which will be the most common querying pattern for cross-tenant queries.
    • Also introduced a query projection for user-specific queries to help platform administrators query for all audit logs related to a specific user.
  • Moved to using hour-bucketed timestamps as the primary key's leading column to improve data-locality, compression, and query performance.

I've modified the 001_initial_schema.sql migration instead of adding a new migration because this service has not been released yet.

Related changes

  • Removed stage from the schema and the querying interface since we're only collecting the ResponseComplete stage from the system.
  • Adjusted the apiserver to intelligently change the order by used when querying clickhouse to ensure projections are used based on the query being performed by the end-user.
  • Updated the performance tests have been updated to better reflect real-world querying behavior where the api group / resource are present in the queries.

Unrelated changes

  • Upgraded to the v1.9.0 version of our shared actions to resolve an issue with the wrong tag being injected to the kustomize builds.
  • Moves to using a ReplacingMergeTree database engine to ensure that all audit logs are unique. Removing duplicates is a background operation so users may see duplicates if a merge operation hasn't been performed. To help prevent duplicates, I adjusted the NATS configuration and Vector configuration to de-duplicate audit logs based on the audit ID. The audit ID is guaranteed to be unique since we only collect the ResponseComplete stage.

Performance test results

These performance tests were executed against a 3-node Clickhouse cluster with each node being allocated 8 CPU and 30Gi of RAM. The performance test ramps up RPS against the system querying 24 hours of audit logs. We're collecting around ~24M audit logs per day.

Previous Clickhouse schema

This shows a performance test that was run against the activity system that was focused on tenant-level querying. The graphs show that the activity api would struggle with a small number of platform-level queries (~4 RPS) and queries would immediately begin timing out.

image

New optimized Clickhouse schema

This performance test demonstrates the improvements that were made after the new schema was applied. We were able to top out at 96.5 RPS with a P99 response time of 2.5 seconds. No timeouts or errors occurred with the performance test.

image

Resources


Relates to datum-cloud/enhancements#536

As we began running performance tests against the activity apiserver, we
noticed that platform-wide queries were performing drastically worse
than tenant-level queries.

This was a result of our initial schema being designed to order data by
tenant resulting in platform-wide queries scanning the entire data set
instead of being able to skip over irrelevant rows.

This change makes several adjustments to the schema to improve querying
performance of the clickhouse database.

- Moved to daily partitions so that partitions can be TTL'd each day
  instead of only when the month is over. This should also ensure that
  queries only need to scan fewer partitions because all queries will be
  time-bound.
- Removes unnecessary skip indexes on fields already present in the
  ordering of the data. Skip indexes won't provide much performance
  benefit if the ordering is used.
- Moves to using a ReplacingMergeTree database engine to ensure that
  all audit logs are unique. Removing duplicates is a background
  operation so users _may_ see duplicates if a merge operation hasn't
  been performed. We'll mitigate this in the collection pipeline by
  putting guardrails in place to prevent duplicates from being sent to
  Clickhouse.
- Adds indexes for fields used for common querying patterns to help skip
  over irrelevant rows.
- Creates new projections that are designed to efficiently query audit
  logs across all tenants. The platform-wide query projection is
  designed to support platform administrators querying across all
  tenants. Queries will be most performant when they query by a specific
  api group and resource which will be the most common querying pattern
  for cross-tenant queries. Also introduces a query projection for
  user-specific queries to help platform administrators query for all
  audit logs related to a specific user.

I've modified the 001_initial_schema.sql migration instead of adding a
new migration because this service has not been released yet.

I've also removed `stage` from the schema and the querying interface
since we're only collecting the `ResponseComplete` stage from the
system.

The apiserver has also been adjusted to intelligently change the order
by used when querying clickhouse to ensure projections are used based on
the query being performed by the end-user.

Lastly, the performance tests have been updated to better reflect
real-world querying behavior where the api group / resource are present
in the queries.

See: https://clickhouse.com/docs/data-modeling/projections
We need to configure the merge behavior of projections since we swapped
over to the replacing merge tree engine.
This change adjusts the NATS stream configuration to support a 10 minute
de-duplication window. The NATS message ID has been set to the audit log
ID since the ID will be unique across all audit logs.
@scotwells scotwells force-pushed the fix/improve-platform-wide-query-performance branch from e63a8e6 to 21ce2cf Compare January 9, 2026 22:36
Have to enable JetStream to take advantage of the message_id option.
@scotwells scotwells force-pushed the fix/improve-platform-wide-query-performance branch from 21ce2cf to a5b914a Compare January 9, 2026 22:40
Comment on lines 85 to 87
-- Primary key optimized for tenant-scoped queries
-- Deduplication occurs on the full ORDER BY key during merges
ORDER BY (timestamp, scope_type, scope_name, user, audit_id)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked with @drrev about this Clickhouse schema and he mentioned we should try adjusting this to use the following schema change and see if there's any additional performance gains.

ORDER BY (toStartOfHour(timestamp), scope_type, scope_name, user, audit_id, timestamp)
PRIMARY KEY (toStartOfHour(timestamp), scope_type, scope_name, user, audit_id)

Maybe even use toStartOfMinute depending on volume of data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran a performance test against an updated schema that uses the toStartOfHour() for the order by / primary key to group events into 1 hour buckets.

The performance test completed successfully with no errors.

image

We were able to top out at 96.5 RPS with a P99 response time of 2.5 seconds.


Pushed a change in a6ce224 that adopts the recommended schema changes and apiserver pagination implementation.

Thanks @drrev for the recommendation!

@drewr
Copy link
Contributor

drewr commented Jan 12, 2026

How big was the dataset in the tests? For Clickhouse it seems like 60 RPS, especially for reads, is several orders of magnitude below what it should be, even with billions of documents (which I'm assuming we didn't have...).

@scotwells
Copy link
Contributor Author

@drewr the system had around 77 million rows in it, spread across ~3 days. Most of the queries in the performance test would have been scanning the last 24 hours, so should be looking at ~25 million rows per query.

The current schema uses millisecond-precision timestamps as the primary
key's leading column. For a multi-tenant audit logging system ingesting
events from many Kubernetes control planes, this causes:

- Poor data locality: Events from the same tenant arriving at slightly
  different times are scattered across granules
- Suboptimal compression: Similar events aren't co-located, reducing
  compression effectiveness
- Inefficient deduplication: Duplicate events within the same hour
  aren't consistently grouped
- Higher storage costs: More granules and lower compression ratios
  increase disk usage

Hour bucketing addresses these issues by clustering events from the same
tenant/scope/user within hour boundaries, aligning with how audit logs
are queried (typically 1h, 24h, or 7d time ranges).
Copy link

@ecv ecv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR v.large, approving on trust

@scotwells scotwells merged commit 5f92a71 into main Jan 13, 2026
4 checks passed
@scotwells scotwells deleted the fix/improve-platform-wide-query-performance branch January 13, 2026 19:14
scotwells added a commit that referenced this pull request Jan 15, 2026
## Summary

This PR makes additional improvements to platform-wide querying
performance, adjusts the clickhouse audit log schema to use the correct
timestamp for the request, adds support for querying by the user's UID,
and adjusts the user-scoped projection to use the user's UID value
instead of the username.

## Details

- **Filter by User's UID** - Filtering by UID can be valuable to filter
down to a specific user using a stable identifier instead of an email
which can be changed by the user. UIDs are also only in place for users
of the platform. Internal components that authenticate with certificates
do not have UIDs. This gives us a clean way of filtering out internal
components from audit logs.
- **Request Received Timestamp** - I swapped to using the
`.requestReceivedTimestamp` field of the audit log to represent the
audit log's timestamp since it's the timestamp when the request was
received by the apiserver. The `.stageTimestamp` is used by the
collection pipeline to calculate delays in the pipeline because the
timestamp indicates when the audit log was generated by the apiserver.
- **User UID for user scope** - I swapped to using the user's UID as the
filtering / sorting column when querying the audit log system through
the user scope since the UID is the stable identifier for the user and
is the value that's provided in the user's extra information.
- **Hourly timestamp buckets** - Updated all projections to use the same
hourly time bucketing introduced in #23.

---

Relates to datum-cloud/enhancements#536
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants